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TITLE OF INVENTION 
Method for Predicting Prognosis and Treatment Response 
for Genetically Based Abnormalities 
BACKGROUND OF THE INVENTION 

5 The present invention relates to a method by which one can uncover 

patient and disease differentiations based on novel combinations of clinical 
and genotypical data (e.g., involving specific exon mutations within specific 
genes) which differentiate the patient population in ways that significantly 
enhance the ability to assess the prognosis and select suitable modes of 

1 0 treatment for the disease. In the most simple and basic model of man and 
disease, a single etiology such as an inherited genetic abnormality which, as 
a result, leads to a specific clinical manifestation reflective of the abnormality. 
For each such disease and its unique etiology, perhaps a specific treatment is 
available, and an associated prognosis possible. In many cases, such a 

1 5 simple model of a medical disorder is the exception rather than the rule. 

However, in areas in which we have inadequate understanding, this model 
may still be the assumption with which physicians must work until further 
differentiation is possible. 

The publications and other materials used herein to illuminate the 

20 background of the invention or provide additional details respecting the 

practice, are incorporated by reference, and for convenience are respectively 
grouped in the appended List of References. 

SUMMARY OF THE INVENTION 
The present invention relates to a method by which one can uncover 

25 patient and disease differentiations based on novel combinations of clinical 
and genotypical data which differentiate the patient population in ways that 



t 

I 
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significantly enhance the ability to assess the prognosis and select suitable 
modes of treatment for the d In the method of the present invention, the 
process begins with a table of patient data that characterizes patients' 
medical clinical history in terms of specific gene's exon mutations or important 
5 polymorphism that patients have been found to carry, plus other clinical data, 
isease. 

Given a selection of clinical data including (a) some particular disease 
state and (b) one or more patient clinical data factors the first goal (stage I) is 

10 to characterize (i.e., predict) the selected patient clinical data for the particular 
disease state chosen in terms of the presence of specific characteristics of 
specific genes. The second goal (stage II) is to take these identified 
characterizations derived in stage I and use them to differentiate the patient 
population expressing the chosen disease state in terms of the prognosis of 

1 5 the particular disease and in the efficaciousness of specific treatment 
modalities. 

In summary, the method of the present invention takes clinically 
observable, collectible data from selected patient populations with diseases 
arising from genetic abnormalities, and uses such information to differentiate 
20 classes of patients with regard to specific differences in their genetic makeup, 
which in turn then characterizes outcomes such as prognosis and treatment 
response. 

BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a flow-chart showing a method of the present invention. 
25 DETAILED DESCRIPTION OF THE INVENTION 

In the method of the present invention, the process begins with a table 
of patient data that characterizes patients' medical clinical history in terms of 
specific gene's exon mutations or important polymorphism that patients have 
been found to carry, plus other clinical data which include but are not limited 
30 to: 1) any medical diagnoses they have and medical diagnoses their 
pertinent relatives have, particularly including first and second degree 
relatives (i.e., father and mother, siblings, and children, aunts, uncles, and 
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grandparents on both sides of their family); 2) ages of onset of such diseases; 
3) associated medical and other conditions such as presence of colon polyps 
or high blood pressure; 4) ethnic origin; 5) race; 6) sex; 7) habits such as 
smoking, exposure to toxic chemicals; 8) occupations carrying medical risks 

5 such as agricultural workers; and 9) other additional information which may be 
available or obtainable in a patient's and patient's relatives' medical chart or 
file. For example, additional information for each patient might also include 
any treatments received, the outcome(s) of the treatment(s), ongoing patient 
management issues such as "time until first re-occurrence of the disease," 

10 etc. 

Given a selection of clinical data including (a) some particular disease 
state and (b) one or more patient clinical data factors (such as "a significantly 
earlier onset" of that specific disease than is typical of the general population), 
the first goal (stage I) is to characterize (i.e., predict) the selected patient 

1 5 clinical data for the particular disease state chosen in terms of the presence 
of specific characteristics of specific genes, for example, specific mutated 
exons and/or over-expression. The second goal (stage II) is to take these 
identified characterizations derived in stage I and use them to differentiate the 
patient population expressing the chosen disease state in terms of the 

20 prognosis of the particular disease and in the efficaciousness of specific 
treatment modalities. 

In summary, the method of the present invention takes clinically 
observable, collectible data from selected patient populations with diseases 
arising from genetic abnormalities, and uses such information to differentiate 

25 classes of patients with regard to specific differences in their genetic makeup, 
which in turn then characterizes outcomes such as prognosis and treatment 
response. 

DEFINITIONS 

In order to provide an understanding of several of the terms used in the 
30 specification and claims, the following definitions are provided: 

"Data Mining Software" is widely available in the marketplace from 
numerous vendors. The technique and methodology essentially analyzes a 
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set of information, presented typically in a matrix array, in which each row of 
the matrix represents an example or instance of the phenomenon of interest 
and each cell within the row (i.e., column of the matrix) represents some 
descriptor of the phenomenon. For example, each row may represent a 

5 patient, with the first column containing a code indicating which disease (if 
any) the patient has had, the second column indicating what age he or she 
had the disease, the third indicating whether the patient is a smoker ("yes" or 
"no" value), etc. For certain rows (in our example, viewed as patients), these 
rows are marked "true" examples of any phenomenon of interest on the part 

10 of the user of the software (e.g., certain patients are identified as true cases 
of "hereditary breast cancer," or true cases of "patients who have survived a 
certain disease five or more years," or true cases of patients "who do not buy 
life insurance," etc.). 

Data mining software uses computational techniques based on the 

1 5 mathematical theory of rough sets to select constituent elements (column 

values) for each row so that one or more boolean logical relationships among 
these elements characterizes (i.e., predicts) all the "true" examples that were 
marked, as noted above. One can select which column values the software 
may utilize to construct possible rules. For example, if the rows, viewed at 

20 patients, were marked for the "true" hereditary breast cancer cases and just 
exon mutations are specified as the allowable descriptive elements from the 
database, one boolean logical relationship (or "rule") the software could 
uncover might be that if there is a "yes" in the column marked "had a positive 
BRCA1 "gene test" or a "yes" in the column marked "had a positive BRCA2 

25 gene test, " then any patient meeting this rule (A or B) is a member of the 
"true" set of hereditary breast cancer persons. If all medical factors in the 
database were allowed, another rule the software might develop could be: If 
a patient had beast cancer, the age of onset was before age 35, and there 
were two or more first degree relatives similarly affected, than the patient is 

30 one of the "true" hereditary breast cancer persons. Such a rule would be of 
the form (X and Y and Z). Through an efficient yet exhaustive process that is 
complete and systematic, data mining algorithms can find such valid rules or 
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patterns that can be constructed so that one or more of the "true" cases are 
characterized by that rule. The set of all such rules taken together define or 
characterize the set of "true" patterns provided, and as such, may be thought 
of as an expert rule-based system which defines or characterizes the 

5 patterned "true" set selected. Moreover, since the process examines the 
actual "true" cases to calculate the rules, its output can specify which cases 
are valid examples of which rules. 

One off-the-shelf data mining product we have used in a part of our 
invention is DataLogic/R+1.5 from Reduct Systems, Inc., Regina, 

10 Saskatchewan, Canada, although numerous software choices and vendors 
are available. 

"Disease" includes but is not necessarily restricted to any measurable 
condition of an individual that is not normal in some statistically probabilistic 
sense for some subset of the population, any malfunction of an individual or 

1 5 dysfunctional condition or genomic defect for a subset of the population, a 
genomic abnormality such as a mutation in a gene or identifiable 
polymorphism or other significant abnormality in part of the genome, or a 
genomic variation for some subset of the population which is measurably 
different from another subset, where such disease(s) can be manifest in 

20 behavioral changes, physical changes, tissue conditions, tissue or genomic- 
level conditions and changes, or other performance changes as is 
distinguishable from a different subset of the population. 

"Exon Mutation" Particular parts of the DNA structure, the genes, 
provide the blueprints for the construction of proteins, assembling each 

25 protein in steps. Each step is dictated by and active coding region of the 

gene, and exon, which is interrupted by spans of non-coding regions (introns). 
If one of these coding regions has a flaw and can not properly assist in its role 
in the formation of the protein under construction, then the flawed area 
represents an exon mutation. 

30 "Genotypical Data" includes but is not necessarily restricted to data 

about the genome of an individual, cellular factors or substances that 
influence or affect or alter the genome, measurable alterations in the genome, 
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data that reflect different levels of expression of the genome or parts of the 
genome or genome-related conditions, data that reflect the integrity or quality 
of the genome, or data that bear on the interrelationships between and 
among genomic elements of the cell or the cellular factors that influence the 

5 genomic elements. 

"Patient Clinical Data Factors" includes but is not necessarily 
restricted to data that can measure or represent the status of an individual, 
any disease of that individual, or characteristics of that individual which might 
be included in a health-related database, with such data to also include 

10 information about an individual that distinguishes the individual from others 

(e.g., age, race) which may be used to characterize the healthy as well as the 
diseased individual. 

"Prognosis" includes but is not necessarily restricted to possible 
outcomes, results, implications or consequences of disease, a subsequent 

1 5 state or status of a patient which could be psychological, behavioral, 
physiological, genomic-related, or other measurable difference as a 
consequence of disease or an abnormal condition. 

"Treatment Outcome" includes but is not necessarily restricted to 
outcomes or results as a consequence of the administration of a drug or other 

20 therapeutic substance, or the administration of a therapeutic regimen which 
may involve ingesting any substance, a physical procedure, a behavioral 
procedure, a surgical procedure, any procedure that may lead to an alteration 
in the individual's genome or genomic-related condition, or any other health- 
related action intended to alter the health status of an individual. A treatment 

25 outcome may also be a measurable condition of the individual that 

distinguishes one state from a prior one even without knowing what agent 
created the transformation. 

One method of the present invention includes the following steps: 
Step 1 The first step is to select a particular syndrome, disease or 

30 combination of diseases of interest, such as hereditary breast- 

ovarian cancer. Data are assembled within a database as an 
array of patient information about patients who have exhibited 



PCT/US98/18261 

7 

the selected disease or syndrome which includes for each 
patient, data regarding other specific diseases he or she may 
have had, diseases any first and second degree relatives have 
had as well as other immediate relatives, the age when any of 
the diseases occurred, any genetic mutations that have been 
determined, itemized by gene and by exons within the gene, 
and other clinical data which might be found in a clinical or 
medical record including treatments, drugs taken, outcome data 
with regard to treatment or the course of the disease for each 
patient as well as any relatives, etc. 
Second, one selects a particular disease, syndrome, or 
combination of diseases of interest d h such as hereditary breast- 
ovarian cancer. 

Next, one or more medical database elements (patient 
characterizations Cj) are chosen for patients with disease d h 
such as "had early onset of the disease" (for the example of 
hereditary breast-ovarian cancer this may be defined as the 
onset of the disease before the age of 40). The subset of 
selected patients who have the disease of interest and who 
exhibit these medical factor(s) chosen is defined to be the "true" 
examples (for the data mining software) of the pattern to be 
investigated. The subset of selected patients with the disease 
state but who do not exhibit the medical factors Cj is labeled as 
"false" examples of the pattern (for the data mining software). 
A similar set of patients with the selected disease is also chosen 
and the factor(s) chosen in the first step are negated {-C; they 
do not exhibit Cj), and the combination is considered the "true" 
patterns. In this example, patients with hereditary breast- 
ovarian cancer who were late-onset would be the "true" 
examples of the pattern. The subset of selected patients with 
the disease state but without exhibiting the negated factors is 
labeled as "false" examples of the pattern (for the data mining 
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software). 

Step 4a/b Data mining methods are applied using as the "true" and "false" 
patterns those identified in steps 3a and 3b. The logical 
elements of the database which are specified for use by the 
5 data mining method (i.e., the specific columns for each data 

entry row) are the data about specific exon mutations of specific 
genes in individual patients. For example, assume that we have 
in our database entries for any detected exon mutations for the 
genes BRCA1, BRCA2, MLH1, and MSH2 for patients with 

10 hereditary breast-ovarian cancer. Then all of these are marked 

for the data mining process as the allowable elements (i.e., 
column values) which may be used to define the true set 
designated first in step 3a, then in step 3b. 
Step 5a The output from the data mining software is a set of rules ("rule 

1 5 set") R[CJ that characterizes what logical mix of exon mutations 

of genes can be constructed so that the pattern is defined. 
These rules comprised of the occurrences of specific exon 
mutations "characterize" the designated "true" patterns. For our 
example, if one rule is "[Exon 2 of BRCA1 or Exon 3 of BRCA1]" 

20 arising in step 4a, then we know that if a patient exhibits the 

characteristics of having a mutation in either exon 2 or 3 of gene 
BRCA1 , then that patient will be an example of the pattern of a 
"true" case of "hereditary breast-ovarian cancer of early onset." 
Step 5b Output from data mining software for step 4b is a set of rules 

25 involving the exons for the negation of the medical condition 

under consideration, R[-C] t which is an analogous set of rules 
as arise in step 4a. 
Step 6 The two sets of rules in steps 5a and 5b are compared (as 
described below) to see if there is any logical overlap of 

30 constituent elements (i.e., exon mutations) comprising the two 

rule sets. There constitutes a logical overlap if one condition, 
such as [mutation of exon 2 of BRCA1], occurs as an element in 
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both a rule in the rule set for 5a as well as in a rule in the rule 
set for 5b. If no such logical overlap exists for at least one exon, 
then the medical factor(s) (in our example, the condition "early 
onset" versus "late onset") is said to be a splitting condition for 
the disease based on one or more of the elements of definition 
(i.e., one or more exon mutations). The two sets of exon 
mutations arising in rules in steps 5a and 5b are the tWQ splitting 
vectors (of exon mutations) for the splitting condition. The set of 
exons unique to each of the splitting vectors are called the 
splitting exons for the splitting vectors. It is ultimately such 
significant splitting conditions and their associated splitting 
vectors and splitting exon mutations for a disease state and 
associated medical factor(s) that we desire to obtain. This 
splitting condition means there are some differentiating clinical 
manifestation(s) for a disease state (in our example, "early" 
versus "late onset" for the disease hereditary breast-ovarian 
cancer) which can be reflected at the exon mutation level. This 
distinction of exon mutations in effect partitions the patient 
population with an otherwise apparently single umbrella disease 
(e.g., hereditary breast-ovarian cancer) into a potentially 
worthwhile, more refined division from a genetics point of view 
(such as, for example, "early onset hereditary breast-ovarian 
cancer" versus "late onset hereditary breast-ovarian cancer"). 
If no significant splitting condition is found, another attribute is 
selected by the program from among the database attributes 
within the database (such as "there are two first degree relatives 
with the same disease"). This attribute is similarly tested for its 
prospects as a splitting condition. If one is obtained, then the 
second stage (step 7) of analysis is entered. If not, each 
database attribute can be tested in a loop, one at a time, then 
two attributes taken together can be selected, and in a loop, 
tested as a splitting condition. Again, if none is found, three 
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attributes taken together may be selected, etc. With a high 
speed computer, this cycling automatically considers all 
combinations in hopes of finding a splitting characteristic or a 
combination of attributes which will constitute a splitting 
condition. 

Once a splitting condition is obtained, (together with one or 
more associated combinations of splitting exon mutations 
comprising the associated splitting vector), a prognosis or 
treatment outcome attribute T, and its negation -Tj are selected 
where T, * C;. In our example, for the disease state "hereditary 
breast-ovarian cancer," and the attribute "early onset," selected 
treatment and outcome parameters (such as "five years or more 
of remission" and "received radiation treatment") can be 
identified. Generally, T t can in fact be a boolean expression 
comprised of entries found in the database d r 
The first of the two set(s) of genetic exon mutations in the two 
splitting vectors from step 6 is used as the permissible 
characteristics to define the patterns for patient records which 
exhibit the prognosis or treatment outcome selected in step 7. 
For example, assume exons e 1f e 2 , and e 3 in gene BRCA1 are 
the characterizing exon mutations (the splitting exons for the 
first vector) for the disease state "hereditary breast-ovarian 
cancer," and the splitting condition "early onset." Then the 
constituent elements e 1f e 2 , and e 3 in the splitting vector 
represent the eligible characteristics that may be used with data 
mining software to define or characterize a prognosis or 
treatment outcome Tj for the disease condition and medical 
attribute(s) corresponding to the splitting condition (e.g., "early 
onset" and "hereditary breast-ovarian cancer"). As noted, data 
mining methodology is invoked, attempting to characterize these 
true patterns in terms of the exon set comprising the 
corresponding first splitting vectors. In our example, with the 
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exons e 1t e 2l and e 3 as the splitting set, this means one 
attempts to use these entries to characterize the "true" patients 
with their selected treatments and/or outcomes with boolean 
functions of the exon elements e 1f e 2l and e 3 . 
5 Step 8b As in step 8a f the exons in the first splitting vector is used to 
characterize the negative form of Tj. 
Step 8c Analogous to step 8a, the exons in the second splitting vector 

are used to characterize the "true" set defined to be the patterns 
with dj and -TV 

10 Step 8d As in step 8c, the exons in the second splitting vector are used 
to characterize the "true" patterns defined to be the patterns 
with d| and -Tj. 

Step 9 If step 8a-d is successful, the "true" cases are characterized by 
rule sets from the data mining. 

1 5 Step 10 If step 9 is successful for significantly characterizing the true set, 
then the corresponding constituent exons contained in each rule 
set are defined to be a successful two-stage splitting result (i.e., 
working for both stage I and stage II). One can derive for 
example the fact that if the patients who are early onset 

20 hereditary breast-ovarian cancer patients have certain exon 

mutations they are very responsive to radiation therapy, but late 
onset hereditary breast-ovarian cancer patients do not have 
such an exon mutation (but have their own particular set of 
exons) and thus are not responsive to radiation. It is noted that 

25 certain combination of results in step 10 are more desirable than 

others. 

Step 11 To cycle through the database, one returns to step 7 and other 
prognosis and treatment outcomes attributes are selected (e.g., 
all hereditary breast-ovarian patients with one year reoccurrence 
30 rates, all with two year, all who responded favorably to radiation, 

all who responded favorably to chemotherapy, etc.) one by one 
using the first two sets of exons in EJCJ and E 2 [-CJ. If a more 
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than two-stage level is desired, we employ all four sets of exons 
derived in step 9 and enter step 7 with two pairs of exon sets 
rather than just two sets. Then in step 8, each pair is treated as 
each prior single set was treated. One may continue such a 
5 process for as many stages as desired. In the general case, the 

constituent elements of the splitting vectors derived are used as 
the permissible elements to define logical rules to characterize 
the prognosis condition or treatment outcome conditions 
selected. Once each prognostic measure or treatment outcome 
10 combination is selected these attributes may be combined into 

more complex patterns and tested, taking first two attributes in 
combination, then three, etc (e.g., does the exon set [e1, e2, 
and e3] characterize five year remission cases treated with both 
radiation and chemotherapy). 
15 In selected cases, step 4a or 4b may be null, i.e., one-sided splitting 

conditions, obtaining no results, wherein 5a or 5b would be null. This is 
allowed since step 6 indicates that "one or both of [results] is not empty" (but 
it need not have been the case that both had to be "not empty"). The 
implications of this is that if only one "side" is non-empty, then that side is 
20 carried through in the remaining steps. Somewhat similarly, in step 7m, 
attribute Tj and -Tj is selected, and results obtained if possible in the four 
parts of step 8. Again, if one or more parts of step 8 are not productive in 
deriving results, the process continues, and any output derived is exhibited in 
step 9. Finally, if no Tj is even selected at step 7, the process by default goes 
25 through steps 8 and 9 in the null case, and terminates as it loops back to step 
7. 

These above 11 steps create exon-based defining patterns derived 
from clinical manifestations which will characterize prognosis and treatment 
outcome patterns for differentiated versions of disease conditions. If 
30 achieved, a logical link has been uncovered connecting clinical attributes 

(such as early onset or late onset cancer) with a set of aberrant exons which 
in turn characterize medical prognosis or treatment outcomes. 
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A further method of the present invention includes the following steps, 
in addition to the steps of the above method: 

In steps 3a and 3b of the method above, medical factors such as "early 
onset" are selected as interesting aspects of a medical disease (in this 
5 example, hereditary breast-ovarian cancer). Similarly in step 7, various 
prognosis or treatment outcomes are selected. In each case, exhaustive 
selection involving one or more combinations of factors may be chosen. 
Step 1 We apply data mining to correlate mutational outcomes with 

patient attributes in order to guide choices for steps 3a, 3b and 
10 step 7. In our heuristic module's method for steps 3a and 3b, 

the "true" pattern of interest is the clinical family history of the 
patients who have (a) the selected disease state and (b) who 
have tested positive for any specific intragenic mutation(s) in 
any gene(s). All of the exons may be used by the data mining 
15 software to define the "true" set. All other patient records are 

labeled as "false." In step 7, at which point we have a splitting 
vector with available splitting exons, condition (b) becomes a 
positive occurrence of any one or more of these splitting exons. 
Step 2 The data mining technology is then challenged to define what 
20 clinical medical attributes and what values for each attribute 

would define rules that characterized the disease state and 
mutations marked. This resulting set of rules might characterize 
the true set with such attributes such as "early onset" or "5 years 
of remission." All constituent elements comprising the rules 
25 heuristically suggest interesting choices for steps 3a, 3b or step 

7 respectively. 

The above additional two steps of the heuristic module yields choices 
for each of the two steps that may be more likely productive and informative 
than random selection alone. 
30 EXAMPLES 

The following examples are provided to further illustrate the present 
invention and are not intended to limit the invention beyond the limitations set 



WO 99/11822 PCI7US98/18261 

14 

forth in the appended claims and amendments. 

Example 1 

Clinical data are collected on family histories of patients with the 
selected disease of hereditary breast-ovarian cancer who manifested a 
5 variety of mutations in the BRCA1 and BRCA2 genes (which yield breast 
and/or ovarian cancers). First, patients are defined who exhibited a 5382- 
insertion C mutation in BRCA1 as true. Patients who carried some mutation 
other than 5382-insertion C were labeled as false. 

Using data mining technology to derive rules, the following results were 
10 obtained: 

1 . All rules contained the attribute "early age of onset" with very 
high values. 

2. Nearly all rules indicated a rather intense family tree of breast 
cancers, with typically more than 3 cases within one generation 

15 of each other. 

One of the actual rules takes the form of: 

Ca32a=4 AND 5< Early <6 AND Gen=2 
W here Ca32a is the code for first degree relatives with breast cancer (which in 
this rule must have at least four such cases). Early refers to early onset of 
20 breast cancer, where three points are assigned if any case arises at or before 
the age of 35, two points are assigned if a case arises between age 36 and 
age 45, and one point is assigned for a case arising between 46 and 50. Gen 
refers to the presence of cancers in the same generation, which for this rule 
requires at least two cases in the same generation. Using our heuristic 
25 module, any of these three attributes (Ca32a, Early, and Gen) would be 
suitable as a choice in steps 3a and 3b. 

Example 2 

Gatekeeper genes directly regulate the growth of tumors by inhibiting 
growth or promoting death. Predisposed individuals who inherit one mutant 
30 copy of a gatekeeper gene need only one additional (somatic) mutation to 
initiate neoplasia. Moreover because the probability of acquiring a single 
somatic mutation is exponentially larger than the probability of acquiring two 
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such mutations, people with a gatekeeper gene mutation are not only at 
greater risk of developing tumors than the general population but since only 
one more "hit" is needed, they can be expected to express such tumors 
earlier in life (early onset). 
5 Since we know that both BRCA1 and BRCA2 contain a region that act 

as a transcriptional-activation domain when it is fused to a DNA-binding 
domain from another gene (Milner, et al M 1997; Chapman, et al., 1996; 
Monteiro, et al., 1996), and transcription factors are often found among the 
gatekeeper class of cancer-susceptibility genes, this property indicates 

10 BRCA1 and BRCA2 may directly control cellular proliferation. Moreover 

BRCA1 can inhibit the growth of cells in which it is over-expressed (Holt, et 
al., 1996), and there is also a link between an inhibitor of cell-cycle-dependent 
kinase and BRCA1 protein (Hakem, et al., 1996). Thus there is reason to 
suspect BRCA genes of a gatekeeper function. From this gatekeeper status 

15 arises the expectation that BRCA mutations would lead to tumors much more 
frequently than the general population (i.e., higher risk), and also sooner than 
tumors arising usually (early onset). Both expectations are true for BRCA 
mutations. 

However, Sharan et al. and Scully et al. provide evidence that BRCA 
20 genes are caretaker genes. Mutations in caretaker genes do not promote 

tumor initiation directly. Rather neoplasia occurs indirectly, since inactivation 
leads to genetic instabilities which result in increased mutations in all genes 
including gatekeeper genes, which then as noted above, may progress rapidly 
to tumor genesis. So if a patient inherited a caretaker gene defect, three 
25 somatic mutations would be needed to initiate cancer: mutation of the normal 
caretaker allele inherited from the unaffected parent, followed by mutations of 
both alleles of a gatekeeper gene. Moreover mutations in caretaker genes 
would not be expected to lead to sporadic cancers very often since four 
mutations would have to occur (two caretaker alleles plus two gatekeeper 
30 alleles). BRCA genes might be added to the caretaker list since mutations in 
BRCA1 and BRCA2 are rarely found in sporadic cancers. Thus we have two 
contending roles for BRCA1 and BRCA2, gatekeeper and caretaker. 
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Before proposing a resolution of this issue, we may take note of the 
fact that both BRCA genes bind to Rad51 , a protein that is involved in 
maintaining the integrity of the genome (Sharan, et al., 1997; Scully, et al., 
1997). Moreover Sharan et al. report that BRCA2 knockout mice show early 
5 embryonic lethality and hypersensitivity to irradiation, similar to that observed 
in RAD51 knockout mice. Earlier BRCA1 was found to bind to RAD51 
(Scully, et al., 1997), and BRCA1 knockout mice also show early embryonic 
lethality (Hakem, et al., 1996; Liu, et al. 1996). We add the fact that studies 
in both yeast and mammalian cells indicate that RAD51 is involved in 

10 resolving double-stranded DNA breaks in recombination-linked repair, and 
thus we can expect that disruption of the BRCA/Rad51 pathway might be 
expected to lead to genetic instability. Consistent with this interpretation is 
the observed hypersensitivity to irradiation in embryonic and trophobiast cells 
from BRCA2 and Rad51 knockout mice, Sharan et al., 1997; Lim, et al., 1996. 

15 Putting all of this together, we could postulate two classes of exon 

mutations in BRCA genes: (1) exon mutations related to gatekeeper functions 
but unrelated to genomic stability regions such as the Rad51 binding site 
would disrupt the gatekeeper status of the BRCA genes, yielding autosomal 
dominant early onset breast cancer, and (2) mutations unrelated to the 

20 gatekeeper status but harmful to genomic stability regions such as the Rad51 
binding site disrupt the caretaker status of the BRCA genes, yielding late 
onset breast cancer. In summary, different exon mutations can affect either 
caretaker or gatekeeper-like functions in BRCA genes. Thus radiation of 
early onset breast cancer might be more efficacious since genomic instability 

25 arising from a compromised DNA caretaker role is not the primary issue, but 
conversely, radiation might be far less efficacious in late onset where the 
caretaker role of the gene and genomic instability (as in the knockout mice) is 
of significance. 

Broadly speaking, this discussion demonstrates that there is a 
30 reasonable prospect that there is a connection to be uncovered between 
clinical features (such as early versus late onset) and exon positions and 
concomitant prognosis or treatment outcomes. 



WO 99/11822 PCT/US98/18261 

17 

The exons involved in all BRCA1 patients who were 50 years or older, 
with no relatives of these patients under 40 were examined. The exons that 
characterized or predict these cases are: 5272; 300C; exon 5 missing 

Then we looked at all BRCA1 patients who expressed cancer under 
5 the age of 40, and who had no relatives over 50 with breast cancer. The 

exons predicting this group which did not overlap in any way the first set were: 
332 11T;188del11;4713intC. 

Thus early versus late stage onset (as described by the example) 
constitutes a total splitting condition, and the exons found would distinguish 
10 early from late onset. Prognosis and treatment outcomes were then 

characterized from these two different sets of exon mutations, completing the 
application of the method. In this example, the prognosis condition "P" 
indicating high penetrance of the cancer (many generations), determined how 
the two sets of exons characterized P and -P. The output was 4 rule sets, 
1 5 which constitute the results of the process. 

Example 3 

Although specific exon mutations within specific genes may be utilized 
for the genotypical (i.e., genome-related) data, our method is not exclusively 
or restrictively constrained just to exon mutations, this merely being one type 
20 of genomic-related data, but rather genotypical data of other formulations as 
well. 

Specification of disease in the method of the present invention may not 
be merely a traditional labeling such as pneumonia or diabetes but may in 
fact be dysfunctional conditions abnormal conditions as exhibited in tissue, or 

25 otherwise some abnormalities that can be contrasted with a baseline 

normality. The abnormality may even as yet have no formal name associated 
with it in medical science, and may be for example an over-expression of 
some gene for which medical science is not certain what to call or label the 
abnormality other than to contrast it to a baseline normal condition. 

30 Concomitantly, a prognosis may include a specific outcome or 

condition or state for a patient, and also may include a level of physiological 
status (measured by laboratory values), or a genomic-related status such as a 
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measure of the degree of over-expression of a gene which can characterize 
an organism. As yet another variation, the outcomes in step 5a and 5b may 
not be necessarily so simple as the absence or presence of just an exon, but 
also may be the absence or presence of any factor used in the database 

5 which was stated to be comprised of clinical and genotypical data. 

Moreover, it is also the case that the outcome in steps 5a and 5b may 
not be merely so simple as the presence or absence of a data element but 
also may be a function involving a data element such as element e1 is less 
than or equal to a certain value, greater than or equal to a certain value, or in 

10 general, any arithmetic function of the data element. Such variations of 

genotypical descriptions, disease states, prognosis states, and presence or 
value of a data element are all within the purview of the method's arena of 
application, although these variations are not inclusive of all variations, but 
only articulate some of the additional possible variations. 

15 in this example, we describe such a variation of the type of genotypical 

data that can be used in applying this method, variations on the range of 
disease or abnormal situations to which the method may apply, variations on 
the prognosis states that may be selected in using this method, and variations 
in the data elements that comprise the set of rules from the data mining which 

20 characterize the data set. 

To begin, specific over-expressed gene for a specific tissue is selected 
as the "abnormal" state (i.e., the "disease" for the method). As the medical 
factor Ci of interest, a specific range for the over-expression of the gene as 
the medical factor of interest is selected, with the specific range as measured 

25 by some laboratory method. The database includes the specific ranges of 

expression of this gene for each patient plus clinical data to describe patients, 
typical of data to be found in any comprehensive medical record. Thus with 
this selection, step 3a becomes: the true set are those patients with the over- 
expressed gene (their abnormal condition) with the value of the over- 

30 expression within some specifically designated range (as the medical factor of 
interest) along with their associated clinical and genotypical descriptors which 
may include exon mutation data, as well as also include other medical 
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descriptors as well. 

In this example, step 3b constitutes the contrasting group of what 
amounts to those patients who do NOT have the same range of the gene 
over-expression, but otherwise are the same. Working with this example, we 
5 can derive by applying step 4a/b what characterizes the two different groups, 
in step 3a and 3b. The items that characterize the two groups can be the 
presence or absence of descriptor factors or arithmetic measures of elements 
such as "patient's age is less than some value." 

Thus one could use this approach when we have different levels of 

1 0 gene over-expression for some tissue set and we want to characterize this 
level of over-expression from the non-similar levels of over-expressors. The 
splitting condition becomes those elements that cleanly characterize one 
group from the other. If there exists a further outcome or prognosis as per 
step 7, the method steps are continued, however if none exists, we can, by 

15 default select none, default to step 9, which defaults to step 7 again under the 
condition "none remaining," which ends the process. 

While the invention has been disclosed in this patent application by 
reference to the details of preferred embodiments of the invention, it is to be 
understood that the disclosure is intended in an illustrative rather than in a 

20 limiting sense, as it is contemplated that modifications will readily occur to 
those skilled in the art, within the spirit of the invention and the scope of the 
appended claims. 

Example 4 

The method of the present invention can also be applied where a 
25 disease d t is a medical condition such as stroke, the clinical attribute Cj is a 
drug X is given to the patient, and its negation -C, is drug X not given (i.e., a 
placebo or null drug is given), and the rest of the characterizing elements are 
a mixture of clinical data and genotypical data. Steps 4a and 4b, 5a and 5b, 
and step 6 are then followed. In step 7 the outcome T, selected is "a positive 
30 response to drug X as measured by the results of a trial for drug X" while its 
negation is "no positive response by the results of a trial for drug X." Hence 
step 8 characterizes the drug-taking positive responders, the placebo 
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positive-responders, and characterize the drug-taking non-responders and the 
placebo-taking non-responders. 

Example 5 

The method of the present invention can further be applied where a 
5 disease to be abnormal tissue of some type, the attributes C t is a series of 
genes differentially expressed for the abnormal tissue, and then characterize 
the tissue in step 4 with the set of said genes in terms of clinical and 
genotypical data available for the tissue. In step 7, the treatment outcome 
can be measured in terms of some transformation of the gene series selected 
10 as Ci. Thus, step 8 obtains the data elements which characterize the 

abnormal tissue with a baseline genetic characterization that has undergone 
a genetic transformation, and that has failed to undergo a genetic 
transformation. One can also obtain the characterization of the tissue which 
did not exhibit the genetic series C, but which underwent a transformation as 
15 well as did not undergo a transformation. 

REFERENCES 

1. Milner, J. et al. Nature 386, 772-773 (1997) 

2. Sharan et al. Nature 386, 804-810 (1997) 

3. Chapman, M.S. et al. Nature 382, 678-679 (1 996) 

20 4. Monteiro, A.N. et al. Proc. Natl Acad. Sci USA 93, 1 3595-1 3599 (1 996) 

5. Holt, J.T. et al. Nature Genet. 12, 298-302 (1996) 

6. Hakem, R. et al. Cell 85, 1 009-1 023 (1 996) 

7. Scully, R. et al. Cell 88, 265-275 (1 997) 

8. Liu, C.Y. et al. Genes Dev. 10, 1835-1843 (1996) 
25 9. Lim, D.S. et al. Mol. Cell Biol. 16, 7133-7143 (1996) 



WO 99/11822 



21 



PCT/US98/18261 



CLAIMS 

I claim: 

1 . A method for predicting the prognosis and a treatment response 
for genetically based abnormalities, comprising the steps of: 
5 selecting a disease for analysis; 

selecting a disease factor of said disease for analysis; 
selecting a genetic abnormality of said disease for analysis; 
marking said genetic abnormality as a first splitting factor if said 
genetic abnormality is existing in persons exhibiting said disease factor 
10 and not existing in persons not exhibiting said disease factor; 

selecting a treatment option for said disease factor; and 
marking said treatment option as a second splitting factor if a 
desired treatment outcome is existing in persons who received said 
treatment option and not existing in persons who did not receive said 
15 treatment option. 
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